Simple Compound Splitting for German
نویسنده
چکیده
INTRODUCTION • Compound: concatention of two or more words Apfel|baum (apple tree) Apfel|kuchen|rezept|sammlung (apple cake recipe collection) • Productive word formation process → infinite amount of possible compounds • Compound splitting useful for many NLP applications – Statistical Machine translation: translation of new compounds, better lexical coverage – Information retrieval: better generalization
منابع مشابه
Chasing the Perfect Splitter: A Comparison of Different Compound Splitting Tools
This paper reports on the evaluation of two compound splitters for German. Compounding is a very frequent phenomenon in German and thus efficient ways of detecting and correctly splitting compound words are needed for natural language processing applications. This paper presents different strategies for compound splitting, focusing on German. Four compound splitters for German are presented. Tw...
متن کاملStatistical Machine Translation of German Compound Words
German compound words pose special problems to statistical machine translation systems: the occurence of each of the components in the training data is not sufficient for successful translation. Even if the compound itself has been seen during training, the system may not be capable of translating it properly into two or more words. If German is the target language, the system might generate on...
متن کاملGerman Compounds in Factored Statistical Machine Translation
An empirical method for splitting German compounds is explored by varying it in a number of ways to investigate the consequences for factored statistical machine translation between English and German in both directions. Compound splitting is incorporated into translation in a preprocessing step, performed on training data and on German translation input. For translation into German, compounds ...
متن کاملHow to Avoid Burning Ducks: How to Avoid Burning Ducks: Combining Linguistic Analysis and Corpus Statistics for German Compound Processing
Compound splitting is an important problem in many NLP applications which must be solved in order to address issues of data sparsity. Previous work has shown that linguistic approaches for German compound splitting produce a correct splitting more often, but corpus-driven approaches work best for phrase-based statistical machine translation from German to English, a worrisome contradiction. We ...
متن کاملDetermining Immediate Constituents of Compounds in GermaNet
In order to be able to systematically link compounds in GermaNet to their constituent parts, compound splitting needs to be applied recursively and has to identify the immediate constituents at each level of analysis. Existing tools for compound splitting for German only offer an analysis of all component parts of a compound at once without any grouping of subconstituents. Thus, existing tools ...
متن کامل